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Abstract 

This paper presents a new version of Dropout called 
Split Dropout (sDropout) and rotational convolution tech¬ 
niques to improve CNNs’ performance on image classifi¬ 
cation. The widely used standard Dropout has advantage 
of preventing deep neural networks from overfitting by ran¬ 
domly dropping units during training. Our sDropout ran¬ 
domly splits the data into two subsets and keeps both rather 
than discards one subset. We also introduce two rota¬ 
tional convolution techniques, i.e. rotate-pooling convolu¬ 
tion (RPC) and flip-rotate-pooling convolution (FRPC) to 
boost CNNs ’ performance on the robustness for rotation 
transformation. These tw’o techniques encode rotation in¬ 
variance into the network without adding extra parameters. 
Experimental evaluations on ImageNet2012 classification 
task demonstrate that sDropout not only enhances the per¬ 
formance but also converges faster. Additionally, RPC and 
FRPC make CNNs more robust for rotation transforma¬ 
tions. Overall, FRPC together with sDropout bring 1.18% 
(model of Zeiler and Fergus [24], 10-view, top-1) accuracy 
increase in ImageNet 2012 classification task compared to 
the original netw’ork. 


1. Introduction 

Since proposed by LeCun et al. [12] in the early 1990s, 
convolutional neural networks (CNNs) have demonstrated 
excellent performance in visual tasks, such as image clas¬ 
sification and object recognition. Recently, deep convo¬ 
lutional neural networks have shown state-of-the-art per¬ 
formance in challenging large-scale image classification. 
Rrizhevsky et al. [9] show record beating performance on 
the ImageNet 2012 classification benchmark [17], with 
their convnet model achieving an error rate of 16.4%, com¬ 
pared to the 2nd place result of 26.1%. The recent work 
of He et al. [5] shows an classification accuracy exceeding 
what human can achieve for the first time. 


Several factors contribute to the success of deep CNNs 
in image classification: availability of large-scale labeled 
training data, powerful GPU support and effective regu¬ 
larization strategies. When large datasets are available, 
CNNs are capable to extract effective features from the im¬ 
ages by designing more layers and adding units to the net¬ 
works. However, overfitting is really a problem in such net¬ 
works with a large number of parameters. Fortunately, a 
wide range of techniques for regularizing have been devel¬ 
oped, such as adding an I 2 penalty on the network weights, 
Bayesian methods, weights elimination and early stop of 
training. Among the regularizing techniques, the Dropout 
proposed by Hinton et al. [ 6 ] is an effective way to not 
only reduce overfitting, but also give great improvements 
on many benchmark tasks. 

Additionally, the two key concepts of CNNs, i.e. lo¬ 
cal receptive fields and weight-tying, help the networks en¬ 
code translational invariance into the features learned from 
images. CNNs take translated versions of the basis func¬ 
tion (the convolution filters) and pool over them. In this 
way, different image locations share the same basis function 
and thus the number of parameters to be learned is signif¬ 
icantly reduced. By pooling over neighboring units CNNs 
hard-code translation invariance into the model. However, 
this strategy prevents the pooling units from capturing more 
complex invariance, such as rotation invariance. The exper¬ 
iments of Zeiler and Fergus [24] validate that the outputs of 
CNNs are not invariant to rotation transformation. 

In this paper, we improve the CNNs in two aspects: a 
new version of Dropout and the improvement of rotation in¬ 
variance of CNNs. The core of standard Dropout proposed 
by Hinton et al. [ 6 ] is to randomly drop half the units in each 
fully-connected layer in forward propagation and then take 
the other half of the units through back propagation. This 
prevents units from co-adapting too much, but only half of 
the parameters are trained in every training iteration. Here, 
we propose a new version, called Split Dropout (sDropout), 
which splits the units into two subsets rather than drops half 
of them. In this way, all the weights are trained while the 
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co-adapting between units is broken. sDropout can be a 
substitute for Dropout in any circumstances. To add ro¬ 
tation invariance into the model, we introduce two rota¬ 
tional convolution techniques, i.e. rotated-pooling convo¬ 
lution (RPC) and flip-rotate-pooling convolution (FRPC), 
to the convolution layers of CNNs. By rotating or flipping 
the convolution filters, and then through convolution and 
max-pooling, the modified CNNs are more capable to ex¬ 
tract rotational/flipping invariant features. 

We validate our methods on the large-scale 1000-class 
ImageNet 2012 classification datasets [17], Our model is 
based on the architecture of Zeiler and Fergus [24] with 
Rectified Linear Unit (ReLU) [15] and Parametric Recti¬ 
fied Linear Unit (PReLU) [5] respectively. By replacing the 
Dropout with sDropout, we observe not only lower testing 
error but also faster convergence in training. Additionally, 
our RPC and FRPC do actually improve the classification 
accuracy and demonstrate rotation invariance on rotated im¬ 
ages. We note that sDropout and rotational convolution 
techniques do not change the original architecture and add 
very little memory and computation cost. 

2. Related Work 

In the last few years, we have witnessed tremendous per¬ 
formance improvements made on the CNNs. These im¬ 
provements are mainly two types: building more power¬ 
ful architectures and designing effective regularizing tech¬ 
niques. To be more capable of fitting training datasets, espe¬ 
cially for large-scale datasets, models are made deeper and 
larger, e.g., the work of Simonyan et al. [20] and Szegedy el 
al. [21 ]. Setting strides smaller to capture more image infor¬ 
mation also helps in model improvement, such as the work 
of Zeiler and Fergus [24], Rectified Linear Unit (ReLU) 
[ 15] and Parametric Rectified Linear Unit (PReLU) [5] con¬ 
tribute to the recent success of improvements on activation 
functions. On the other hand, regularization is an important 
aspect in boosting the testing performance of CNNs. Data 
augmentation [9, 7] and the Dropout [6] technique are re¬ 
cently the common way to regularize the model. One major 
work in this paper is to improve Dropout. 

In standard Dropout [10], each element of a layer’s out¬ 
put is kept with probability p, otherwise set to 0 with proba¬ 
bility (1 - p) (usually set p = 0.5). It can be seen as a stochas¬ 
tic regularization technique. Since the successful applica¬ 
tion of Dropout in feedforward neural networks for speech 
and object recognition [10], several works have been done 
on the improvement and analysis of this technique. A gen¬ 
eralization of Dropout, ’standout’ proposed by Ba et al. [1], 
uses a binary belief network to compute the probability for 
each hidden variable. They believe several hidden units are 
highly correlated in the pre-dropout activities. DropCon- 
nect, proposed by Wan et al. [22], sets a randomly selected 
subset of weights, rather than activations, to zero. To speed 


up the training process in Dropout, Wang et al. [23] pro¬ 
pose a fast dropout. The model uses an objective function 
approximately equivalent to that of real standard dropout 
training, but does not actually sample the inputs. sDropout 
is an extension of Dropout and can be easily combined with 
methods mentioned above without conflict. 

In spite of recent progresses in feature extraction from 
images [2], representing complex invariance of images in 
different learning systems is still a challenging task and at¬ 
tract many works. In unsupervised deep learning systems, 
Zou et al. [26] utilize slowness and non-degeneracy prin¬ 
ciple which can also be applied to still images. Le [11] 
proposes an autoencoder to recognize faces. The feature 
detector is reported to be robust to translation. Zeiler et 
al. [25] present a hierarchical model using alternating lay¬ 
ers of convolutional sparse coding and 3D max-pooling to 
learn image decompositions. In object recognition, a new 
image is decomposed into multiple layers of features by the 
learned model and then recognized by a classifier. Ngiam et 
al. [ 1 6] propose a tiled convolutional neural network that 
does not require adjacent hidden units to share identical 
weights, but only needs hidden units k steps away from each 
other to have tiled weights. With TIC A pretraining strat¬ 
egy, the architecture is able to learn scale and rotation in¬ 
variance. Wavelet scatting networks, introduced in [13, 14] 
build translation invariant image representations with aver¬ 
age poolings of wavelet modulus coefficients. The follow- 
on work by Sifre et al. [19] propose a joint translation and 
rotation invariant representation of image patches, which 
is implemented with a deep convolution network. The fil¬ 
ters of the deep convolution network are not learned but are 
scaled and rotated wavelets. 

A more direct way to add invariance is Data augmen¬ 
tation. Related work of Sermanet et al. [ 1 8] build a jittered 
dataset by adding transformed versions of the original train¬ 
ing set, including translation, scaling and rotation. There 
are many more works that transform input data to obtain in¬ 
variance, e.g. Howard and Andrew G [7], Dosovitskiy et al. 
[3], But augmenting the training set with rotated versions 
does not achieve the same effect as ours, as it can not be re¬ 
stricted to upper layers. The rotation invariance in our work 
is achieved by pooling over systematically transformed ver¬ 
sions of filters. It is closely related to the recent work of 
Gens et al. [4], which pools features over symmetry groups 
within a neural network. 

3. Approach 

The sDropout model is an extension of Dropout, which 
keeps the same sampling process in fully-connected lay¬ 
ers as Dropout does, but takes the thrown-away units as 
an extra inputs of the next layer. Then we discuss the ro¬ 
tation convolution techniques: rotate-pooling convolution 
(RPC) and flip-rotate-pooling convolution (FRPC). These 


two techniques extract rotation invariant features through 

can be described as the follow: 

the rotated and flipped filters in convolution layers com¬ 
bined with max-pooling. 
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Dropout was proposed by Hinton et al. [6] as a regular¬ 
ization to prevent the deep neural nets with a large number 
of parameters from overfitting. The key idea is to randomly 
drop units with probability p from the network during train¬ 
ing, which breaks the co-adapting among units. The sur¬ 
vived units make up a thinned network. A neural net with 
n units can be seen as a collection of 2" possible thinned 
neural nets. So training neural network with dropout can be 
seen as training a collection of 2" thinned networks. At test, 
single neural net without dropout is used with the weights 
multiplied by drop probability p. In the experiments of this 
paper, p is set to 0.5. This strategy achieves the effect of 
averaging exponentially many thinned models. As a result, 
the performance of neural nets with Dropout is significantly 
improved. 

sDropout is an extension of Dropout. Similar to Dropout 
in the random sampling stage, sDropout randomly samples 
the input units, which breaks the co-adapting among units 
and thus prevents the model from overfitting. The key dif¬ 
ference between them is that the units failing to survive in 
sampling stage are fed into an extra thinned network that 
shares the same weights with the previous sub-network, 
rather than discarded. Consequently, the inputs are split and 
trained in two sub-networks. The modification brings two 
benefits. On one hand, keeping all the units can guaran¬ 
tee that all theweights are trained in each iteration, while 
only half (p is set to 0.5) of weights are updated in Dropout. 
As a result, this strategy will accelerate the convergence 
rate. On the other hand, averaging between the same net¬ 
work with different inputs help improve performance. No¬ 
tably, sDropout inherits several advantages of Dropout: (1) 
it breaks co-adapting among units and thus prevent over¬ 
fitting; (2) it approximately combining exponentially many 
different neural network architectures efficiently. 

To illustrate our sDropout model, we consider a fully- 
connected layer l of a neural network. Let = 

[z\,Z 2 ,..., z n } T be the inputs of layer l and (of size 
d x ri) be the weight parameters ( biases are included in 
with a corresponding fixed input of 1 for simplic¬ 
ity). The outputs from the layer l are denoted by = 
[z/i j 2/2 j - - - 5 Ud \ T , computed by multiplying the input vector 
with the weights matrix followed by a non-linear activation 
function a. The feed-forward process of standard Dropout 


where * denotes element-wise product and m is a binary 
vector of size d, each element is an independent Bernoulli 
random variable, rrij ~ Bernoulli{p). Each element of a 
layer’s outputs is kept with probability of p, otherwise is set 
to 0 with probability of (1 — p). Thus, the outputs of layer 
l, i.e. y l , are multiplied with the random sample vector m 
and get a thinned outputs to be used as inputs to the next 
layer l + 1 . Many experiments and theoretical analysis 
have indicated that Dropout improves the network’s gener¬ 
alization ability and test performance. However, only p part 
of the weights are renewed each iteration in this way and 
the training procedure is retard. 

The proposed sDropout differs from Dropout in that the 
outputs of layer l, yW are split into two parts, i.e. yj^ 

and y l '(' 2 ) > through a random sample vector m. Then 
they are taken as two different inputs for the consequent 
weight connection and activation. This process picks up the 
dropped outputs (l e — m) * y <r> as another inputs for layer 
l + 1. The feed-forward process with sDropout is described 
in equation (2), where l e denotes a vector of size d with 
all elements being 1. Figure 1 illustrates the procedure of 
sDropout in one fully-connected layer. 
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Figure 1. An example of a single sDropout layer. One output of 
layer l is randomly sampled into two parts for layer l + 1 , both 
taken for weight connection and activation. Finally, we get two 
outputs of layer l + 1. (Recommend read in electronic copy) 
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Learning process is just the same as that of the standard 
Dropout, using stochastic gradient descent (SGD) manner. 
If we take the sampled network in standard Dropout net¬ 
work as sub-network, the network reconstructed by the 
dropped outputs can be seen as an extra-sub-network. For¬ 
ward and backpropagation are done in sub-network and 
extra-sub-network. The cost function of sDropout model is 
computed by combining the sub-network and the extra-sub- 
network. Particularly, let f(x\ 9 , m) denote the cost func¬ 
tion of Dropout model, given data x, the parameters 9 = W 
and the randomly-drawn sample mask m. As Dropout is a 
stochastic regularization technique, the correct value of cost 
function is obtained by summing out over all possible sam¬ 
ple masks m. L(W ) and L S {W) denote the cost of Dropout 
and sDropout respectively. We can see that the cost of both 
are equal. 


L(W) = E m [f(x\ 9, m)] = ^p(m)/(a;;0,m). (3) 

m 

L S (W) = 9, m) + f(x\ 9, l e — m))/2] 

= Em pMK/Og 0, m ) + /O; 0 , ie - m))/2] 
= EmPM/OL^m) 

= L(W). 

(4) 

The gradient for each parameter is averaged over the 
training cases in each min-batch. In each iteration, all the 
parameters are updated, while the ratio of weights being up¬ 
dated in standard Dropout is p. We can expect that the con¬ 
vergence will be faster, as the weights have being updated 
more frequently. In addition. We get a better estimate of 
gradient through the split subnetworks and sDropout per¬ 
form better than Dropout. sDropout still enjoy the benefit 
of regularization since inputs are randomly split into two 
groups of samples, which breaks the complex co-adapting 
among units. 

When it goes to test, as Dropout does, we use a single 
network without sDropout to approximate the average of 
the exponentially many thinned models. The weights used 
are scaled-down by multiplying drop probability p. Our 
sDropout makes it possible as standard Dropout does to 
train a huge number of different networks in a reasonable 
time. 

sDropout can be applied to multiple fully-connected lay¬ 
ers in the same way of one-layer sDropout. Since every 


sDropout layer splits the network into two, in the case of ap¬ 
plying sDropout in n layers, there will be 2" sub-networks. 
Forward and backpropagation are done in these 2"' subnet¬ 
works. Since fully-connected layers do not take much time 
compared to convolution layers, the added computing time 
is not much. 

3.2. Rotate-Pooling Convolution 

The features automatically learned from images by the 
CNNs are meaningful [24], The convolution layers extract 
features from low-level to high level, and invariance be¬ 
comes greater as layer becomes higher. Although data aug¬ 
mentation by adding translated, rotated or scaled data can 
bring invariance in some extent, it can not be restricted to 
upper layers and get high-level invariance. Thanks to the 
local receptive fields, weight-tying and pooling strategy of 
CNNs, translation invariance is hard-coded into the CNNs. 
However, this prevents the pooling units from capturing ro¬ 
tation invariance [16]. The work of Zeiler and Fergus [24] 
visualizes the features captured from the top and bottom 
layers in CNNs and validates that the network output is not 
invariant to rotation, expect for the object with rotational 
symmetry. However, rotation invariance is crucial to the 
feature extraction. 

To settle down the problem, we propose two rotational 
convolution techniques, i.e. rotate-pooling convolution 
(RPC) and flip-rotate-pooling convolution (FRPC). Similar 
to translation invariance encoded into the convolution lay¬ 
ers, we introduce rotated filters using local receptive fields, 
weight-tying and pooling strategies to encode the rotation 
invariance into the convolution layers of CNNs. The convo¬ 
lution layers in original networks are replaced by RPC and 
the other stages are kept unchanged. 

We rotate the convolution filters and do max-pooling on 
the output feature maps. Respectively, the convolution filter 
is rotated 0,45, 90,...,315 degrees in-plane and thus we get 8 
filters sharing the same weights. Then we convolute the in¬ 
put feature map with these 8 filters and get 8 outputs, which 
can be viewed as a 8-channel map. Finally, we do max¬ 
pooling among the 8 channels to get a final feature map as 
one output of the layer. We call this strategy rotate-pooling 
convolution (RPC). CNNs with RPC enjoy the two benefits 
of (1) being able to capture rotated features, and (2) having 
no extra parameters to train. While RPC is only applied on 
last few layers, there will be only little extra time for com¬ 
putation. 

In this paper, we intend to extract rotation invariance in 
high-levels and thus we apply RPC in high-layers conv3i, 
conv32, and conv33 of the CNNs architecture, which is de¬ 
scribed in section 4. For example. Figure 2 shows the rotate¬ 
pooling convolution in layer conv3i. To train the model, the 
feed-forward process is done as described previously and 
the backpropagation process is the same as the traditional 


max-pooling layer. Note that our model with RPC does not 
change the Network architecture and training process, and 
can be easily applied to the convolution layers of other con¬ 
volutional neural network based systems. 

We let r percentage of convolution filters undergo rota¬ 
tion, and the rest remains unchanged. We don’t apply RPC 
to all the filters in each layer because the orientations of 
objects are sometimes useful for classification and recogni¬ 
tion. Parameter r is tunned on the training procedure. Our 
experiment result shows r = 50% is better than r = 100%. 
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Figure 2. The rotate-pooling convolution in layer conv3i. The 
first row is one input of layer conv3i, a 13 x 13 x 256 feature 
map. The second row shows the 8 rotated 3 x 3 x 256 filters 
sharing the same weights. By convoluting the input feature map 
with the 8 filters, we get 8 different feature maps in third row (in 
blue). Finally, through a max-pooling procedure which is element¬ 
wise over the 8 features maps, we get one output feature map of 
layer conv3i.(Recommend read in electronic copy) 


one output of the layer. We use FPRC to abbreviate the new 
convolution neuron Flip-Rotate-Pooling convolution in the 
following section. 

In the experiment. We also do not let all the filters un¬ 
dergo flip. The ratio of filters being flipped right-left and 
up-down is set by experiment adjustment and the selection 
of filters is random. 

4. Implementation Details 

We use the standard fully supervised convnet model of 
Zeiler and Fergus [24] to validate our improvements on 
CNNs. Table 1 below shows the 8-layer architecture of 
the model we used in experiments. The model settings are 
the same as [24] except that we use sDropout in the fully- 
connected layers fcl & fc2 and FRPC or RPC in conv3i, 
conv32, & conv3 :i . In the rotational convolution layers, the 
percentage of filters to be rotated and flipped in each layer is 
set to r = 100%, r = 50%, or r = 25%. Note this process 
is the same for every sample. 

Our code is based on the Cuda-convnet package 1 . We 
adopt ’’data parallelism” [8] on the convolution layers. The 
GPUs are synchronized before the first fully-connected 
layer. We implement two GPUs, 64 image samples on each 
in every iteration. 


layer 

size 

convl 

7x7,64,/2 

pooll 

3 x 3,/ 2 

conv2 

5 x 5,256 ,/2 

pool2 

3 x 3,/ 2 

conv3i 

3x3,384 

conv32 

3x3,384 

conv33 

3x3,256 

pool3 

3 x 3,/ 2 

fcl 

4096 

fc2 

4096 

fc3 

1000 


Table 1. Architecture of 8 layer convnet model we use in this paper. 


3.3. Flip-Rotate-Pooling Convolution 

Since we introduce rotation to the convolution filters, it 
i natural to consider about the flipped filters. In fact, the 
symmetric characters in images can be observed and human 
vision systems can recognize the rotated or flipped objects 
without any problem. Here, we modified the neurons in 
convolution layers by adding flipped filters. Similar to RPC, 
the selected filter is flipped up-down or right-left to produce 
one more filter sharing the same parameters. Then the input 
feature map is convoluted with 2 filters (1 original and 1 
flipped) to produce a 2-channel feature map. Like what RPC 
does, a max-pooling across 2 channels is done and we get 


5. Experiments 

The proposed sDropout and the rotational convolu¬ 
tion techniques are evaluated on the 1000-class ImageNet 
2012 [17] classification task. The ImageNet Large Scale 
Visual Recognition Challenge [17] is a venue for evaluat¬ 
ing the most effective image classification and recognition 
methods and CNNs have demonstrated a large step forward 
on ImageNet 2012 since [6, 9]. We use the provided 1.2 
million training images for training. All the trained mod¬ 
els are evaluated on the 50 k validation images with pub- 

1 https://code.google.eom/p/cuda-convnet/ 























lished labels and results are measured by top-1 and top-5 
error rates [17], 

We use the following protocol for all experiments unless 
otherwise stated. For data preparation, each RGB image is 
preprocessed by resizing the small dimension to 256, crop¬ 
ping the center 256 x 256 region, subtracting the per-pixel 
mean and then using one randomly copped one-view sub- 
region of size 224 x 224. Data augmentation is not used. 
In parameter setting, we use mini-batch SGD on batches of 
128 images with learning rate of 0.2 and momentum pa¬ 
rameter fixed at 0.9. All weights are initialized at 0.01 and 
biases are set to 1. 

We show error rate decreases of sDropout, RPC 
and FRPC on the CNNs model with activation function 
ReLU [15] or PReLU [5], PReLU was proposed He et 
al. [5] which exhibits better performance than ReLU used 
in [24] in deep neural networks. We adopt PRelu in our 
modifications to make the model more powerful. For conve¬ 
nience, let Z, ZP denote the model of Zeiler and Fergus [24] 
with ReLU, PReLU respectively. Similarly, ZPS, ZPSR, 
ZPSRF for ZP model with { sDropout }, { sDropout, RPC 
} and { sDrpout, FRPC } respectively. Overall, we observe 
a 1.18% error rate decline (10-view, top-1) on model ZP¬ 
SRF compared to ZP. 

5.1. sDropout v.s. Dropout 

We conducted comparisons between sDropout and stan¬ 
dard Dropout using model Z and ZP. We apply Dropout and 
sDropout in the two fully-connected layers fci, fc 2 - Fig¬ 
ure 3 shows the convergence rate of model Z (with learning 
rate set to 0.1) and ZS (with learning rate set to 0.2) in train¬ 
ing and testing. From the curves in the figure, we can find 
out that model ZS converges faster than model Z in both 
training and testing. In addition, the error rate of model ZS 
is 0.53% lower than Z. On validation dataset, the top-1 and 
top-5 error rates of ZS (with learning rate set to 0.2) are 
37.19% and 15.69%, while top-1 and top-5 error rates of Z 
(with learning rate set to 0.2) are 37.72% and 15.81% using 
10-view testing. Despite the accuracy increment, sDropout 
takes little extra computing time. In figure 4, the training 
time is compared between Z and ZS model. 

5.2. RPC & FRPC 

Here, we analyse the performance of model ZPSR on 
50fc ImageNet validation dataset. These images are rotated 
by various angles from 0°to 360°to get 64-views. We set 
the rotate percentage r = 100% in this experiment, i.e all 
of the filters in conv3i, conv32 and conv33 are rotated. In 
Figure 5, we see the average classification accuracy (top- 
1) of model ZP, ZPS and ZPSR over different rotation de¬ 
grees. As the image rotation degree increases, the accuracy 
of the three models drops dramatically, which indicates that 
CNNs are susceptible to rotation transformation. Due to ro- 



Figure 3. The convergence of model Z with Dropout and sDropout. 
The x-axis is the number of training and testing epochs. The y-axis 
is the 10-view top-1 error. The dot line corresponds to testing error 
rate on validation samples and solid line corresponds to training er¬ 
ror rate. The model ZS with learning rate of 0.2 (in red) has faster 
convergence than model Z with learning rate of 0.2 on validation 
samples. In addition, model ZS has slightly lower error rate than 
model Z.(Recommend read in electronic copy) 



Figure 4. The training time of model Z with Dropout and 
sDropout. The x-axis is training time. The y-axis is the 10-view 
top-1 error. The solid line (in red) corresponds to training error 
of model Z with Dropout and the dot line (in blue) corresponds to 
model Z with sDropout respectively. Model ZS takes little extra 
time compared to model Z.(Recommend read in electronic copy) 

tational symmetry of some images, peaks of the accuracy 
curve appear at rotation degrees of 90°, 180° and 270°. It is 
shown that model ZPSR always outperforms the other two, 
especially at around 180°. Besides, the average accuracy 
of model ZPS is slightly higher than ZP. It is validated that 
rotate-pooling convolution does improve the robustness of 
CNNs against rotation transform. 

Since the classification accuracy is averaged by 50/.: im¬ 
ages, we would like to explore the effect of RPC to differ¬ 
ent kinds of images. We pick some images in validation 
dataset that both models of ZPS and ZPSR can classify ac¬ 
curately. The classification results are compared through 
























probability of true label against the rotation degrees. Un¬ 
surprisingly, the effect of enhancement varies in different 
kinds of images. In general, probability of true label in¬ 
creases more or less by using Rotate-Pooling. To make it 
clear, we pick some typical images that have different fea¬ 
tures against rotations. As Figure 6 shows, images in top, 
middle and bottom are the typical images and their corre¬ 
sponding accuracy curves against different rotation degrees 
are shown below. As we can see, images in top row have 
some robustness to rotation transform at 90°, 180°, 270°in 
both model. The accuracy curve of model ZPS has four 
peaks at these degrees, while model ZPSR also has a better 
performance at other degrees. Images in middle are clas¬ 
sified accurately only at small rotation degrees by model 
ZPS, while model ZPSR has high probability at most of the 
degrees. In the bottom, the images are classified with low 
probability at original degrees or others by model ZPS. For 
these images, model ZPSR has much higher probability at 
most of the degrees. 



Figure 5. Comparisons between ZP, ZPS, and ZPSR on rotated 
testing dataset classification. The curves are average accuracy 
(TOP-1) on 50 k images over the rotation degrees. Model ZPS 
(in green) outperforms the other two models, especially at around 
180°. (Recommend read in electronic copy) 

5.3. CNNs performance comparison 

Finally, we compare the performance of models with 
sDropout or flip-rotate-pooling with different selected ratios 
on ImageNet 2012 classification [17], These models shar¬ 
ing the same CNNs architecture (described in Table 1) and 
parameters setting are (1) Z; (2) ZS; (3) ZSRF (quarter) (ro¬ 
tate percentage r = 25%, flip percentage r = 25%);(4) ZP; 
(5) ZPS; (6) ZPSR(full) (rotate percentage r = 100%); (7) 
ZPSR(half) (rotate percentage r = 50%); (8) ZPSRF (quar¬ 
ter) (rotate percentage r = 25%, flip percentage r = 25%). 
Testing on 50k validation dataset, we use 10-view error rate 



Figure 6. Analysis of rotation invariance within model ZPS and 
ZPSR. Row 1, 3 & 5 : example images undergoing rotation trans¬ 
formations. Row 2,4 & 6: the probability of the true label for each 
image against the rotation degree. (Recommend read in electronic 
copy) 


of top-1 and top-5 to evaluate the classification result. The 
results are compared in Table 2 below. With sDropout, per¬ 
formance of Z and ZP is improved slightly from 37.72% 
error rate to 37.19% and 37.15% to 36.51%. This improve¬ 
ment is obtained with almost no computational cost. Com¬ 
bined with sDropout, model ZPSRF with 25% rotate and 
flip ratio has the lowest error rate among compared models 
on the testing data. sDropout and flip-rotate-pooling con¬ 
volution technique totally bring 1.18% accuracy increase to 






















































Error Rate % 

Val top-1 

Val top-5 

Our replication of Zeiler and Fergus [24] 

37.72 

15.81 

ZS 

37.19 

15.69 

ZSRF (quarter) 

36.60 

15.28 

ZP (Zeiler and Fergus [24] with PReLU [5]) 

37.15 

15.62 

ZPS 

36.51 

15.40 

ZPSR(full) 

36.22 

15.23 

ZPSR (half) 

36.08 

15.10 

ZPSRF (quarter) 

35.97 

15.06 


Table 2. ImageNet 2012 classification error rates. 


the ZP model. 

6. Discussion and Conclusion 

It is shown that introducing rotate-pooling convolution 
and flip-rotate-pooling convolution to the convolution lay¬ 
ers is effective for improving classification performance of 
CNNs. Specifically, we find that using 25% rotate and flip, 
FRPC and sDropout improves the model most. In addition, 
sDropout makes the network converges faster and slightly 
increase testing accuracy. 

In this paper, we introduce Split Dropout, an extension of 
Dropout, to make the CNNs more efficient and effective on 
Image classification. In addition, we propose rotate-pooling 
convolution and flip-rotation convolution to make the mod¬ 
els robust to rotation transformations. These techniques 
bring considerable improvements on the task of classifica¬ 
tion, costing very little extra memory or computing time. It 
is worth mentioning that our formulation is inclusive of var¬ 
ious techniques proposed recently such as multi-model av¬ 
eraging and dropconncet. The model possesses high poten¬ 
tial to be applied in more sophisticated networks to achieve 
better performance. 
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